Ford GoBike is a regional public bicycle sharing system in the San Francisco Bay Area, California. Beginning operation in August 2013 as Bay Area Bike Share, the Ford GoBike system currently has over 2,600 bicycles in 262 stations across San Francisco, East Bay and San Jose. On June 28, 2017, the system officially launched as Ford GoBike in a partnership with Ford Motor Company.
Ford GoBike, like other bike share systems, consists of a fleet of specially designed, sturdy and durable bikes that are locked into a network of docking stations throughout the city. The bikes can be unlocked from one station and returned to any other station in the system, making them ideal for one-way trips. The bikes are available for use 24 hours/day, 7 days/week, 365 days/year and riders have access to all bikes in the network when they become a member or purchase a pass.
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as ms
import zipfile
import requests
import glob
import geopy.distance
from sklearn.cluster import KMeans
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
sns.set_palette("viridis")
Source: https://www.lyft.com/bikes/bay-wheels/system-data
The original link provided in the project (https://www.fordgobike.com/system-data) points to the above link.
File versions: 01/2018 - 04/2019
There are more data files available for the remaining months of 2019 but they are not used as there are some differences like different file names, additional fields, etc. which would require a lot of modification to be used with the remaining data.
# define file names
files = [x for x in os.walk("./data/raw/")][0][2]
for x in files:
if ".zip" in x:
with zipfile.ZipFile(f"./data/raw/{x}", 'r') as zip_ref:
zip_ref.extractall("./data/raw/")
print(files)
# create combined dataset
path = r'./data/raw/'
files = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f) for f in files), ignore_index=True)
df.to_csv('./data/processed/dataset.csv', index=False)
Now that we have our combined dataset.csv file, we have to make sure that the data is ready for an analysis.
df = pd.read_csv('./data/processed/dataset.csv')
df.info(null_counts=True)
Here we can already see, that the start_time and end_time column is not a datetime - object. Also the ID's should be object type. user_type and bike_share_for_all_trip column can be converted to category
df.head()
ms.matrix(df)
It seems that we also have a problem with missing data.
df.isna().sum()
We also have some null values in the fields related to the stations.
df.duplicated().sum()
We don't have any duplicates. Let's dig a little deeper.
df.query('start_station_latitude == 0 | start_station_longitude == 0')
We seem to have a test station with 0 latitude and longitude. We drop this so that it doesn't present as an outlier later on.
index_of_row = df.query(
'start_station_latitude == 0 | start_station_longitude == 0').index
index_of_row
df.drop(index_of_row, inplace=True)
df.query('start_station_latitude == 0 | start_station_longitude == 0')
df[df.start_station_id.isnull()].head(10)
df[df.start_station_id.isnull()].tail(10)
All the rows with missing data have similar coordinates. Also these coordinates are less precise than the other entries with no missing data.
Let's look at the most frequently occuring coordinates.
df[df.start_station_id.isnull()].start_station_latitude.value_counts()
df[df.start_station_id.isnull()].start_station_longitude.value_counts()
The most frequent coordinates are at 37.41 latitude and -121.94 longitude.
axes, figure = plt.subplots(figsize=(10, 5))
sns.scatterplot(data=df[df.start_station_id.isnull(
)], x="start_station_longitude", y="start_station_latitude", alpha=0.15, s=200)
sns.scatterplot(data=df.dropna(subset=["start_station_id"]).sample(
50000), x="start_station_longitude", y="start_station_latitude", alpha=0.15, s=200)
plt.xlim(-121.8, -122)
plt.ylim(37.3, 37.45)
plt.tight_layout()
We can see that these 'nan' stations are mostly out of the range of the other stations.
# if all ids have a start and ending point
start_station_id_list = list(
df.start_station_id.drop_duplicates().dropna().astype("int"))
start_station_id_list.sort()
end_station_id_list = list(
df.end_station_id.drop_duplicates().dropna().astype("int"))
end_station_id_list.sort()
start_station_id_list == end_station_id_list
df_station_names = df[["end_station_id", "end_station_name",
"end_station_latitude", "end_station_longitude"]].copy()
df_station_names.rename(columns={"end_station_id": "id",
"end_station_name": "station_name",
"end_station_latitude": "station_latitude",
"end_station_longitude": "station_longitude"}, inplace=True)
df_station_names.drop_duplicates(inplace=True)
df_station_names.dropna(inplace=True)
df_station_names.sort_values("id", inplace=True)
df_station_names.reset_index(inplace=True, drop=True)
df_station_names.drop(
df_station_names.query('station_latitude == 0 | station_longitude == 0').index, inplace=True)
print(df_station_names)
axes, figure = plt.subplots(figsize=(10, 5))
# sns.scatterplot(data = df[df.start_station_id.isnull()], x = "start_station_longitude", y = "start_station_latitude", alpha = 0.15, s = 200)
sns.scatterplot(data=df_station_names, x="station_longitude",
y="station_latitude", s=200)
# plt.xlim(-122.5, -121.8)
# plt.ylim(37.2, 37.9)
plt.tight_layout()
At first we can see, that there are three clear clusters in this dataset. The website of FordGoBike differs following zones: "San Francisco, East Bay, San José"
df_station_names.id.value_counts()[df_station_names.id.value_counts() > 1]
Here we can see an interesting thing: One ID can refer to multiple different stations.
df_station_names[df_station_names.duplicated("id", keep=False)]
If we filter the data by these stations, which didn't changed the position and just got a new name, we can exclude 4 cases. The rest of the cases are still relevant to investigate.
df_consistency = df[["end_station_id", "end_station_name", "end_station_latitude", "end_station_longitude"]].copy(
).merge(df_station_names, left_on="end_station_id", right_on="id", how="outer")
df_consistency_issues = df_consistency.drop_duplicates().dropna()
df_stat_dist = df_consistency_issues[(df_consistency_issues.end_station_latitude != df_consistency_issues.station_latitude) &
(df_consistency_issues.end_station_longitude != df_consistency_issues.station_longitude)].copy()
def calculate_coord_dist(col):
coords_1 = (col[0], col[1])
coords_2 = (col[2], col[3])
return geopy.distance.distance(coords_1, coords_2).m
df_stat_dist["dist_in_m"] = df_stat_dist[["end_station_latitude", "end_station_longitude",
"station_latitude", "station_longitude"]].apply(calculate_coord_dist, axis=1)
df_stat_dist_data = df_stat_dist.drop_duplicates(
"dist_in_m").sort_values("dist_in_m", ascending=False)
df_stat_dist_data
We can see that the range of distance between new stations lies between 0 to 364 meters (and nearly 4 km in the worst case). This needs to be a part of data cleaning. Now we will look at the overall number structure of this dataset.
df.describe().round(3)
Nothing too obvious here. But there seems to be an outlier at the maximum duration_sec.
df.query("duration_sec == 86366")
df.query("duration_sec > 80000")
We don't usually see trips this long. May be someone forgot to unregister their bike. Or may be some users really rent these bikes for this duration.
# there are multiple ways how to handle this - but since the coordinates are not precise we will just drop them
df.dropna(subset=["start_station_id"], inplace=True)
ms.matrix(df)
for col in ["start_time", "end_time"]:
df[col] = pd.to_datetime(df[col])
for col in ["start_station_id", "end_station_id"]:
df[col] = df[col].astype('int')
for col in ["start_station_id", "end_station_id", "bike_id"]:
df[col] = df[col].astype('str')
for col in ["user_type", "bike_share_for_all_trip"]:
df[col] = df[col].astype('category')
df.info(null_counts=True)
We have two options here. In the first approach, we can just ignore most of these cases, because the different stations are relatively close to each other. That raises the question: How close is close? Another approach is to give the ID's a new "subindex". So every time the coordinates of a station ID differs from the occurrence of this station before, we will increase the subindex by 1, if the calculated difference is > x.
We'll try the second approach
df.sort_values("start_time", inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()
Let's create a function for this
def get_new_id(col):
# when the row is not in id_data - append it
if col[0] not in id_data:
id_data[col[0]] = [col[1], col[2], col[3], f"{col[0]}_0"]
return id_data[col[0]][3]
# if the row exists in id_data, then check if the coordinates change, if yes - calculate the distance and increase the id and replace the \
# saved coordinates in id_data with the new ones, if not, then just return the saved id
elif col[0] in id_data:
if id_data[col[0]][1] != col[2] or id_data[col[0]][2] != col[3]:
coords_1 = (id_data[col[0]][1], id_data[col[0]][2])
coords_2 = (col[2], col[3])
if geopy.distance.distance(coords_1, coords_2).m > 100:
new_ind = str(col[0]) + "_" + \
str(int(id_data[col[0]][3][-1])+1)
id_data[col[0]][3] = new_ind
id_data[col[0]][1] = col[2]
id_data[col[0]][2] = col[3]
return new_ind
else:
return id_data[col[0]][3]
else:
return id_data[col[0]][3]
else:
return "Error"
id_data = {}
df["start_station_id_new"] = df[["start_station_id", "start_station_name",
"start_station_latitude", "start_station_longitude"]].apply(get_new_id, axis=1)
id_data = {}
df["end_station_id_new"] = df[["end_station_id", "end_station_name",
"end_station_latitude", "end_station_longitude"]].apply(get_new_id, axis=1)
df_station_names.head()
df_station_names = df[["start_time", "start_station_id", "start_station_name",
"start_station_latitude", "start_station_longitude"]].copy()
df_station_names.sort_values("start_time", inplace=True)
df_station_names.drop("start_time", axis=1, inplace=True)
df_station_names.rename(columns={"start_station_id": "id",
"start_station_name": "station_name",
"start_station_latitude": "station_latitude",
"start_station_longitude": "station_longitude"}, inplace=True)
df_station_names.drop_duplicates(inplace=True)
df_station_names.dropna(inplace=True)
id_data = {}
df_station_names["new_id"] = df_station_names[["id", "station_name",
"station_latitude", "station_longitude"]].apply(get_new_id, axis=1)
df_station_names.sort_values("id", inplace=True)
df_station_names.reset_index(inplace=True, drop=True)
df_station_names.head()
Let's see if it worked.
df.query("start_station_id == '101'").drop_duplicates("start_station_id_new")[
["start_station_id", "start_station_name", "start_station_latitude", "start_station_longitude", "start_station_id_new"]]
df.query("end_station_id == '101'").drop_duplicates("end_station_id_new")[
["end_station_id", "end_station_name", "end_station_latitude", "end_station_longitude", "end_station_id_new"]]
Seems like it works. We'll save this new data frame into its own file for later use.
# export the station names to csv
df_station_names.to_csv("./data/processed/df_station_names.csv", index=False)
df_station_names.info()
# export the cleaned csv
df.to_csv("./data/processed/dataset_clean.csv", index=False)
df.info()
The FordGoBike data is straightforward and intuitive to understand. This data has been collected efficiently, providing a valuable collection of data to work with and draw conclusions from.
The column header descriptions are as follows (each trip is anonymized):
Following columns have been added during cleaning task:
Another dataset has been created which is a subset of this dataset. This new dataset contains all the stations details with the following columns:
The main features include data related to the start and end of a ride. This can be used to calculate when bikes are in high or low demand. There is also ample data related to the users of the bike service, whether they are a paying member or a casual user. These features can be used to make business decisions such as working class people or students to target in marketing campaigns, or which day of the week bikes should be most available.
The start and end times, and related time-based information will be utilized heavily. Trip related data as well as the distances to understand each trip, its duration, usage etc.
Now that the cleaning part is done, we can start to visualize the data.
df = pd.read_csv("./data/processed/dataset_clean.csv")
df.info()
for col in ["start_time", "end_time"]:
df[col] = pd.to_datetime(df[col])
for col in ["start_station_id", "end_station_id"]:
df[col] = df[col].astype('int')
for col in ["start_station_id", "end_station_id", "bike_id"]:
df[col] = df[col].astype('str')
for col in ["user_type", "bike_share_for_all_trip"]:
df[col] = df[col].astype('category')
df.info()
df.head()
df_station_names = pd.read_csv("./data/processed/df_station_names.csv")
df_station_names.head()
Before we start visualizing, we can extract some additional information out of the data to improve the insights. We saw, that these datapoints can be separated in three clusters. Since these clusters are really obvious, we can classify them in an efficient manner using the K-Means Clustering algorithm.
kmeans = KMeans(n_clusters=3).fit(
df_station_names[["station_longitude", "station_latitude"]])
df_station_names["label"] = kmeans.labels_
for x in set(list(df_station_names.label)):
df_plot_cluster = df_station_names.query(f"label == {x}")
plt.scatter(df_plot_cluster['station_longitude'],
df_plot_cluster["station_latitude"], s=50, alpha=0.5, label=x)
# plt.xlim(-122.5, -121.8)
# plt.ylim(37.2, 37.9)
plt.legend()
Now that this is done, we can visualize this data on a map. This can happen with Bokeh or Plotly, but we will use kepler.gl out of the visualization toolbox for map visualizations. This is an incredibly easy tool to create map based visualizations.
Source: kepler.gl
Now we will map the labels with the original names.
mapping = {0: "San Francisco", 1: "San José", 2: "East Bay"}
df_station_names["label_name"] = df_station_names["label"].map(mapping)
df_station_names.head()
df_station_names.label_name.value_counts()
# duplicates for new coordinates which are not > 100 m away from the origin
df_station_names[df_station_names.duplicated("new_id", keep=False)]
df_station_names.drop_duplicates(subset=["new_id"], inplace=True)
df_station_names.info()
df[df.duplicated(subset=["start_time", "end_time"], keep=False)]
df.info()
Now we can merge the labels to the id's of the main dataframe.
df = df.merge(df_station_names[["new_id", "label"]],
left_on="start_station_id_new", right_on="new_id", how="outer")
df.info()
df[df.label.isnull()].start_station_id_new.value_counts()
df.query("start_station_id == '205'").drop_duplicates("start_station_id_new")[
["start_station_id", "start_station_name", "start_station_latitude", "start_station_longitude", "start_station_id_new", "label"]]
df["label_name"] = df["label"].map(mapping)
df.label_name.value_counts()
for x in set(list(df.label)):
df_plot_cluster = df.query(f"label == {x}")
plt.scatter(df_plot_cluster['start_station_longitude'],
df_plot_cluster["start_station_latitude"], s=20, alpha=0.5, label=x)
plt.legend()
Now we can extract other information out of the start_time - timestamp.
df['month_year'] = pd.to_datetime(df["start_time"]).dt.to_period('M')
df['day_month_year'] = pd.to_datetime(df["start_time"]).dt.to_period('D')
df["dayofweek"] = df["start_time"].apply(lambda x: x.dayofweek)
df["start_hr"] = df["start_time"].apply(lambda x: x.hour)
df["end_hr"] = df["end_time"].apply(lambda x: x.hour)
df.head()
df.info()
There are multiple interesting variables in this dataset. Let's start with the stations first.
value_ct = df.start_station_id_new.value_counts().iloc[:25]
fig, ax = plt.subplots(figsize=(22, 5), dpi=80)
color = sns.color_palette("viridis")[1]
sns.countplot(x="start_station_id_new", data=df,
order=value_ct.index, color=color)
plt.ylim(0, 70000)
cur_axes = plt.gca()
# cur_axes.axes.get_xaxis().set_visible(False)
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left=True)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()),
(p.get_x()+0.08, p.get_height()-4000), color="white")
plt.title("Top 25 Start Stations")
plt.xlabel("Start Station ID")
Overall we can see, that 67_0, 15_0 and 58_0 are the most "used" stations in this dataset. Let's take a look on each group separately.
for x in value_ct.index:
print(x + " - " +
str(df.query(f"start_station_id_new == '{x}'").start_station_name.drop_duplicates().to_numpy()))
df_new = df.query("label == 0").copy()
value_ct = df_new.start_station_id_new.value_counts().iloc[:25]
fig, ax = plt.subplots(figsize=(22, 5), dpi=80)
color = sns.color_palette("viridis")[1]
sns.countplot(x="start_station_id_new", data=df_new,
order=value_ct.index, color=color)
plt.ylim(0, 50000)
cur_axes = plt.gca()
# cur_axes.axes.get_xaxis().set_visible(False)
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left=True)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()),
(p.get_x()+0.08, p.get_height()-2000), color="white")
plt.title("Top 25 Start Stations for San Francisco")
plt.xlabel("Start Station ID")
For San Francisco we can see that this City is leading the trip counter overall.
df_new = df.query("label == 1").copy()
value_ct = df_new.start_station_id_new.value_counts().iloc[:25]
fig, ax = plt.subplots(figsize=(22, 5), dpi=80)
color = sns.color_palette("viridis")[1]
sns.countplot(x="start_station_id_new", data=df_new,
order=value_ct.index, color=color)
plt.ylim(0, 50000)
cur_axes = plt.gca()
# cur_axes.axes.get_xaxis().set_visible(False)
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left=True)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()),
(p.get_x()+0.08, p.get_height()-2000), color="white")
plt.title("Top 25 Start Stations for San José")
plt.xlabel("Start Station ID")
df_new = df.query("label == 2").copy()
value_ct = df_new.start_station_id_new.value_counts().iloc[:25]
fig, ax = plt.subplots(figsize=(22, 5), dpi=80)
color = sns.color_palette("viridis")[1]
sns.countplot(x="start_station_id_new", data=df_new,
order=value_ct.index, color=color)
plt.ylim(0, 50000)
cur_axes = plt.gca()
# cur_axes.axes.get_xaxis().set_visible(False)
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left=True)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()),
(p.get_x()+0.08, p.get_height()-2000), color="white")
plt.title("Top 25 Start Stations for East Bay")
plt.xlabel("Start Station ID")
It seems like the Ford GoBike Program is relatively new in East Bay and San José. These parts have lesser trips than San Francisco. In East Bay the stations 182_0, 243_0 and 176_0 are popular. For San José are the top three 310_0, 296_0 and 312_0. Now that we know, that San Francisco is the city in this project with the most active users, we will now take a look on the duration.
# maybe a customer forgot to log off
bin_size = 100
bins = np.arange(0, df.duration_sec.max()+bin_size, bin_size)
fig, axes = plt.subplots(figsize=(12, 5), dpi=110)
plt.hist(df.duration_sec, bins=bins, color=color)
plt.xticks(ticks=[x for x in range(0, 7000, 250)])
plt.xlim(-100, 6000)
plt.tight_layout()
df.query("duration_sec < 6000").duration_sec.mean()
fig, axes = plt.subplots(figsize=(12, 5), dpi=110)
for x in mapping.values():
df_new = df.query(f"label_name == '{x}'")
bin_size = 100
bins = np.arange(0, df_new.duration_sec.max()+bin_size, bin_size)
plt.hist(df_new.duration_sec, bins=bins, label=x, histtype='step')
plt.xticks(ticks=[x for x in range(0, 7000, 250)])
plt.legend()
plt.xlim(-100, 6000)
plt.tight_layout()
Looking at these, trends are looking similar to each other (right skewed), although it seems like trips in East Bay are usually a little bit shorter in duration.
for x in mapping.values():
print(x, df.query(
f"label_name == '{x}' and duration_sec < 6000").duration_sec.mean())
The mean values also agree on that.
fig, axes = plt.subplots(figsize=(12, 5), dpi=110)
df_new = df.query(f"label_name == 'San Francisco'")
bin_size = 100
bins = np.arange(0, df_new.duration_sec.max()+bin_size, bin_size)
plt.hist(df_new.duration_sec, bins=bins, label=x, histtype='step', color="g")
plt.xticks(ticks=[x for x in range(0, 7000, 250)])
plt.legend()
plt.xlim(-100, 6000)
plt.tight_layout()
fig, axes = plt.subplots(figsize=(12, 5), dpi=110)
df_new = df.query(f"label_name == 'East Bay'")
bin_size = 100
bins = np.arange(0, df_new.duration_sec.max()+bin_size, bin_size)
plt.hist(df_new.duration_sec, bins=bins, label=x, histtype='step')
plt.xticks(ticks=[x for x in range(0, 7000, 250)])
plt.legend()
plt.xlim(-100, 6000)
plt.tight_layout()
fig, axes = plt.subplots(figsize=(12, 5), dpi=110)
df_new = df.query(f"label_name == 'San José'")
bin_size = 100
bins = np.arange(0, df_new.duration_sec.max()+bin_size, bin_size)
plt.hist(df_new.duration_sec, bins=bins, label=x,
histtype='step', color="orange")
plt.xticks(ticks=[x for x in range(0, 7000, 250)])
plt.legend()
plt.xlim(-100, 6000)
plt.tight_layout()
Let's look at the user structure
value_ct = df.user_type.value_counts().iloc[:31]
fig, ax = plt.subplots(figsize=(10, 5), dpi=80)
sns.countplot(x="user_type", data=df, order=value_ct.index, palette="viridis")
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left=True)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()),
(p.get_x()+0.31, p.get_height()+40000))
plt.title("Users By Type")
plt.xlabel("")
value_ct = df.label_name.value_counts().iloc[:31]
fig, ax = plt.subplots(figsize=(10, 5), dpi=80)
sns.countplot(x="label_name", data=df, order=value_ct.index, hue="user_type")
cur_axes = plt.gca()
cur_axes.axes.get_yaxis().set_visible(False)
sns.despine(fig, left=True)
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()),
(p.get_x()+0.08, p.get_height()+40000))
plt.title("Users By Type")
plt.xlabel("")
There seems to be a lot more Subscribers than Customers using the service
For the next plots we will focus on the time components of the data. At first we will explore on which days people like to go on trips. 0 refers to Monday while 6 refers to Sunday.
fig, ax = plt.subplots(figsize=(16, 5))
sns.countplot(x="dayofweek", data=df)
Based on this graph it looks like users use the bikes more during the week than during the weekend. Tuesdays and Thursdays seem to be the most popular days for using the bike sharing system, however Wednesdays, Mondays and Fridays are very close to the Tues/Thurs numbers. The usage drops significantly on Saturdays and Sundays suggesting that the sharing service is used primarily for commuting purposes during working days.
fig, ax = plt.subplots(figsize=(16, 5))
sns.countplot(x="label_name", data=df, hue="dayofweek")
for x in mapping.values():
print(f"{x}\n", df.query(f"label_name == '{x}'").dayofweek.value_counts(
).sort_index() / df.query(f"label_name == '{x}'").dayofweek.count(), "\n")
This applies for all three areas. Tuesday, Wednesday and Thursday are the most active days, followed by Monday and Friday and then Saturday and Sunday. Now let's look on the trips per month/year.
fig, ax = plt.subplots(figsize=(16, 5))
sns.countplot(x="month_year", data=df)
fig, ax = plt.subplots(figsize=(16, 5))
sns.countplot(x="month_year", data=df, hue="label_name")
For all three areas we can see that Bikesharing drops during 2018-11 - 2018-12 followed by highest usage in 2019-03 - 2019-04.
fig, ax = plt.subplots(figsize=(20, 5))
sns.countplot(x="label_name", data=df, hue="month_year")
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# Put a legend to the right of the current axis
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
fig, ax = plt.subplots(figsize=(20, 5))
sns.countplot(x="month_year", data=df.query(
"label_name == 'San José'"))
The drop for the "San José" area held on for on more month till January after which we see a rise in usage.
fig, ax = plt.subplots(figsize=(20, 5))
sns.countplot(x="month_year", data=df.query(
"label_name == 'East Bay'"))
In "East Bay", the service usage increase after December 2018 and is seen highest during 3rd and 4th month.
# assigning colors to day of the week
custom_palette = {}
for q in set(df.day_month_year):
if q.dayofweek == 0:
custom_palette[q] = sns.color_palette("viridis")[0]
elif q.dayofweek == 1:
custom_palette[q] = sns.color_palette("viridis")[1]
elif q.dayofweek == 2:
custom_palette[q] = sns.color_palette("viridis")[2]
elif q.dayofweek == 3:
custom_palette[q] = sns.color_palette("viridis")[3]
elif q.dayofweek == 4:
custom_palette[q] = sns.color_palette("viridis")[4]
elif q.dayofweek == 5:
custom_palette[q] = sns.color_palette("viridis")[5]
elif q.dayofweek == 6:
custom_palette[q] = (224/255, 228/255, 65/255)
else:
custom_palette[q] = 'g'
# creating legend object
legend_obj = []
colors = [sns.color_palette("viridis")[0],
sns.color_palette("viridis")[1],
sns.color_palette("viridis")[2],
sns.color_palette("viridis")[3],
sns.color_palette("viridis")[4],
sns.color_palette("viridis")[5],
(224/255, 228/255, 65/255)]
days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
for i, s in enumerate(days):
legend_obj.append(plt.scatter([], [], color=colors[i]))
fig, ax = plt.subplots(figsize=(16, 5))
sns.countplot(x="start_hr", data=df, ax=ax)
The most frequent starting hours are at 800 and at 1700. Maybe people use it before and after work, which would make sense, because we have a lot of subscribers in working age in our dataset. You only subscribe to something, if you want to use it regularly. The integration into the working/study life would make sense here.
fig, ax = plt.subplots(figsize=(16, 5))
sns.countplot(x="start_hr", data=df.query(
"label == 0"), ax=ax)
fig, ax = plt.subplots(figsize=(16, 5))
sns.countplot(x="start_hr", data=df.query(
"label == 1"), ax=ax)
For San José the trend is going more torwards the hour 1700.
fig, ax = plt.subplots(figsize=(16, 5))
sns.countplot(x="start_hr", data=df.query(
"label == 2"), ax=ax)
As we saw, the most frequent hours are 800 and 1700. I want to see how this trend is changing over time, so for the next plot we will connect these two counts with a line to see the trend based on the slope.
Thankfully, due to thorough cleaning efforts there were no big surprises during the exploratory and visualization phase.
Nothing seems out of the normal expectations for a bike sharing system in a major city. So far, the data reveals that service has been heavily utilized during normal working hours and more and more people are subscribing to the service as per 2018-2019 graphs which is good. A little fiddling was necessary to plot the day wise usage of the service but it was mostly for aesthetic purposes.
Now we are going to dig deeper into the data searching for relationships and trends between variables
df.head()
# join the trips together
df["combi"] = df["start_station_id_new"] + " - " + df["end_station_id_new"]
df.combi.value_counts().head(10)
One interesting insight are the most frequent trips. Based on the data in the upper cell we can see, that the station 6_0 appears often. A heatmap should make this visible.
combi_list = df.combi.value_counts().head(15).keys()
df_criteria = df[df["combi"].isin(combi_list)]
df_pivot = df_criteria.pivot_table(
index="start_station_id_new", columns="end_station_id_new", values="start_time", aggfunc="count", fill_value=0)
fig, axes = plt.subplots(figsize=(15, 10), dpi=70)
sns.heatmap(df_pivot, annot=True, cmap="viridis_r",
fmt='g', vmin=1, vmax=6000, mask=df_pivot == 0)
Here we can see the top 15 routes.
for i, x in enumerate(combi_list):
start, end = x.split(" - ")
start_name = df.query(
f"start_station_id_new == '{start}'").start_station_name.drop_duplicates().to_numpy()
end_name = df.query(
f"end_station_id_new == '{end}'").end_station_name.drop_duplicates().to_numpy()
print(start, " ", start_name[0], " --> ", end, " ", end_name[0])
combi_list = df.combi.value_counts().sample(20).keys()
df_criteria = df[df["combi"].isin(combi_list)]
df_pivot = df_criteria.pivot_table(
index="start_station_id_new", columns="end_station_id_new", values="start_time", aggfunc="count", fill_value=0)
fig, axes = plt.subplots(figsize=(15, 10), dpi=80)
sns.heatmap(df_pivot, annot=True, cmap="viridis_r",
fmt='g', vmin=1, vmax=6000, mask=df_pivot == 0)
Another interesting combination would be the day of the week combined with the average duration. For this we will create a groupby - object (mean) over each month-year combination. A Box Plot should be appropriate to visualize this relationship.
fig, ax = plt.subplots(figsize=(15, 10), dpi=80)
sns.boxplot(x="dayofweek", y="duration_sec", data=df.groupby(
["dayofweek", "month_year"], as_index=False).mean())
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# Put a legend to the right of the current axis
ax.legend(legend_obj, days, loc='center left', bbox_to_anchor=(1, 0.5))
We can clearly see, that under the week the users are going on shorter trips ~780 Seconds while the average duration on the weekend rises to ~1200 seconds.
fig, ax = plt.subplots(figsize=(15, 10), dpi=80)
sns.boxplot(x="dayofweek", y="duration_sec", data=df.groupby(
["dayofweek", "month_year", "label_name"], as_index=False).mean(), hue="label_name")
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
This trend applies for all areas, while we can also see that the users of San Francisco have, on average, the longest duration of trips, followed by East Bay and then San José. Now we should explore if the duration is also different based on the starting hour.
fig, ax = plt.subplots(figsize=(15, 10), dpi=80)
sns.boxplot(x="start_hr", y="duration_sec", data=df.groupby(
["start_hr", "month_year"], as_index=False).mean())
Starting from the hours 0, 1, 2, 3 we maybe have to deal with outliers. On the other hand we saw earlier, that there are not much trips starting at that time, so longer trips have a stronger impact then at 8 o' clock for example. From 5 - 9 trips are relatively short with ~600 seconds, then the average rises to ~950 from 10 - 15. From 16 - 20 it drops again to ~700 seconds to finally increase slightly around 22 and 23.
fig, ax = plt.subplots(figsize=(20, 10), dpi=80)
sns.boxplot(x="start_hr", y="duration_sec", data=df.groupby(
["start_hr", "month_year", "label_name"], as_index=False).mean(), hue="label_name")
Here the hours 0 - 4 are getting more unclear. This graph confirms again, that San Francisco has, on average, the longest duration of trips, while users of East Bay and San José tend to have shorter trips.
fig, ax = plt.subplots(figsize=(15, 10), dpi=80)
sns.boxplot(x="dayofweek", y="start_hr", data=df.groupby(
["dayofweek", "month_year"], as_index=False).mean())
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# Put a legend to the right of the current axis
ax.legend(legend_obj, days, loc='center left',
bbox_to_anchor=(1, 0.5))
This graph shows, that people start their trips, on average, later on the weekend than during the week.
fig, ax = plt.subplots(figsize=(15, 10), dpi=80)
sns.boxplot(x="dayofweek", y="start_hr", data=df.groupby(
["dayofweek", "month_year", "label_name"], as_index=False).mean(), hue="label_name")
Looking at each area is interesting, because Users from East Bay and San José are not only have shorter trip durations on average, but also they start their trips later than San Francisco on average.
fig, ax = plt.subplots(figsize=(15, 10), dpi=80)
sns.boxplot(x="month_year", y="start_hr", data=df.groupby(
["month_year", "dayofweek"], as_index=False).mean())
This shows, that on average the users start later in the middle of the year than at the beginning or the end of the year.
fig, ax = plt.subplots(figsize=(15, 10), dpi=80)
sns.boxplot(x="month_year", y="start_hr", data=df.groupby(
["month_year", "dayofweek", "label_name"], as_index=False).mean(), hue="label_name")
We see that the more trips are taken during the week and start and end at a very predictable time which suggest that people use the service for commute to work. As the year goes by, the work timings change according to the season and the trip start time also changes accordingly which clearly shows. Amount of trips taken have increased as 2019 started off, which can mean that people tend to start riding bikes for better health (new year resolution, maybe...?)
The trips start hours at around 0-400 hrs and their duration at those times points out potential outliers. This was particularly interesting as normal working hours are around 800-900hrs in the morning and 1500-1600hrs in the evening but some of the trips taken were well before the morning and were a lot longer. This may be because people tend to workout in the morning and riding a bike is a pretty good workout but we don't have any data on this so we cannot verify this.
For the last explorative visualization we will take a look on a visualization on the trips with kepler.gl
df_station_names.head()
df_trips = df[["start_station_latitude", "start_station_longitude", "end_station_latitude",
"end_station_longitude", "start_station_id_new", "end_station_id_new"]].copy()
df_trips["cnt"] = 1
df_trips_grp = df_trips.groupby(
[x for x in df_trips.columns[:-1]], as_index=False).sum().sort_values("cnt", ascending=False)
df_trips_grp.head()
df_trips_grp.to_csv("./data/processed/grps.csv", index=False)
At first we will look at San Francisco trips:
We can see that most of the trips are close to the beach.: